Regression with Input-dependent Noise: A Gaussian Process Treatment

نویسندگان

  • Paul W. Goldberg
  • Christopher K. I. Williams
  • Christopher M. Bishop
چکیده

Gaussian processes provide natural non-parametric prior distributions over regression functions. In this paper we consider regression problems where there is noise on the output, and the variance of the noise depends on the inputs. If we assume that the noise is a smooth function of the inputs, then it is natural to model the noise variance using a second Gaussian process, in addition to the Gaussian process governing the noise-free output value. We show that prior uncertainty about the parameters controlling both processes can be handled and that the posterior distribution of the noise rate can be sampled from using Markov chain Monte Carlo methods. Our results on a synthetic data set give a posterior noise variance that well-approximates the true variance. 1 Background and Motivation A very natural approach to regression problems is to place a prior on the kinds of function that we expect, and then after observing the data to obtain a posterior. The prior can be obtained by placing prior distributions on the weights in a neural 494 P. W Goldberg, C. K. L Williams and C. M. Bishop network, although we would argue that it is perhaps more natural to place priors directly over functions. One tractable way of doing this is to create a Gaussian process prior. This has the advantage that predictions can be made from the posterior using only matrix multiplication for fixed hyperparameters and a global noise level. In contrast, for neural networks (with fixed hyperparameters and a global noise level) it is necessary to use approximations or Markov chain Monte Carlo (MCMC) methods. Rasmussen (1996) has demonstrated that predictions obtained with Gaussian processes are as good as or better than other state-of-the art predictors. In much of the work on regression problems in the statistical and neural networks literatures, it is assumed that there is a global noise level, independent of the input vector x. The book by Bishop (1995) and the papers by Bishop (1994), MacKay (1995) and Bishop and Qazaz (1997) have examined the case of input-dependent noise for parametric models such as neural networks. (Such models are said to heteroscedastic in the statistics literature.) In this paper we develop the treatment of an input-dependent noise model for Gaussian process regression, where the noise is assumed to be Gaussian but its variance depends on x. As the noise level is nonnegative we place a Gaussian process prior on the log noise level. Thus there are two Gaussian processes involved in making predictions: the usual Gaussian process for predicting the function values (the y-process), and another one (the z-process) for predicting the log noise level. Below we present a Markov chain Monte Carlo method for carrying out inference with this model and demonstrate its performance on a test problem. 1.1 Gaussian processes A stochastic process is a collection of random variables {Y(x)lx E X} indexed by a set X. Often X will be a space such as 'R,d for some dimension d, although it could be more general. The stochastic process is specified by giving the probability distribution for every finite subset of variables Y(Xl), ... , Y(Xk) in a consistent manner. A Gaussian process is a stochastic process which can be fully specified by its mean function J.L(x) = E[Y(x)] and its covariance function Cp(x,x') = E[(Y(x)-J.L(x»)(Y(x')-J.L(x'»]; any finite set of points will have a joint multivariate Gaussian distribution. Below we consider Gaussian processes which have J.L(x) == O. This assumes that any known offset or trend in the data has been. removed. A non-zero I' (x ) is easily incorporated into the framework at the expense of extra notational complexity. A covariance junction is used to define a Gaussian process; it is a parametrised function from pairs of x-values to their covariance. The form of the covariance function that we shall use for the prior over functions is given by Cy(x(i),xU» =vyexp (-~ tWYl(x~i) _x~j»2) + Jy 8(i,j) (1) 1=1 where vy specifies the overall y-scale and W;:/2 is the length-scale associated with the lth coordinate. Jy is a "jitter" term (as discussed by Neal, 1997), which is added to prevent ill-conditioning of the covariance matrix of the outputs. Jy is a typically given a small value, e.g. 10-6 . For the prediction problem we are given n data points 1) = ((Xl,t1),(X2,t2), Input-dependent Noise: A Gaussian Process Treatment 495 ... , (xn, tn»), where ti is the observed output value at Xi. The t's are assumed to have been generated from the true y-values by adding independent Gaussian noise whose variance is x-dependent. Let the noise variance at the n data points be r = (r(xl),r(x2), ... ,r(xn)). Given the assumption of a Gaussian process prior over functions, it is a standard result (e.g. Whittle, 1963) that the predictive distribution P(t*lx*) corresponding to a new input x* is t* "'" N(t(X*),0'2(X*)), where i(x*) k~(x*)(Ky + KN)-lt (2) 0'2(X*) Cy(x*, x*) + r(x*) k~(x*)(Ky + KN )-lky(x*) (3) where the noise-free covariance matrix K y satisfies [K Y] ij = Cy (x i, X j ), and ky(x*) = (Cy(x*,xd, ... ,Cy(x*,xn»T, KN = diag(r) and t = (tb ... ,tn)T, and V0'2(X*) gives the "error bars" or confidence interval of the prediction. In this paper we do not specify a functional form for the noise level r(x) but we do place a prior over it. An independent Gaussian process (the z-process) is defined to be the log of the noise level. Its values at the training data points are denoted by z = (zl, . .. ,zn),sothatr = (exp(zl), ... ,exp(zn». The priorforz has a covariance function CZ(X(i), xU» similar to that given in equation 1, although the parameters vz and the WZI'S can be chosen to be different to those for the y-process. We also add the jitter term Jz t5(i,j) to the covariance function for Z, where Jz is given the value 10-2 • This value is larger than usual, for technical reasons discussed later. We use a zero-mean process for z which carries a prior assumption that the average noise rate is approximately 1 (being e to the power of components of z). This is suitable for the experiment described in section 3. In general it is easy to add an offset to the z-process to shift the prior noise rate. 2 An input-dependent noise process We discuss, in turn, sampling the noise rates and making predictions with fixed values of the parameters that control both processes, and sampling from the posterior on these parameters. 2.1 Sampling the Noise Rates The predictive distribution for t*, the output at a point x*, is P(t*lt) = f P(t*lt,r(z»P(zlt)dz. Given a z vector, the prediction P(t*lt,r(z» is Gaussian with mean and variance given by equations 2 and 3, but P(zlt) is difficult to handle analytically, so we use a Monte Carlo approximation to the integral. Given a representative sample {Zb ... ' Zk} of log noise rate vectors we can approximate the integral by the sum i Ej P(t*lt,r(zj». We wish to sample from the distribution P(zlt). As this is quite difficult, we sample instead from P(y, zit); a sample for P(zlt) can then be obtained by ignoring the y values. This is a similar approach to that taken by Neal (1997) in the case of Gaussian processes used for classification or robust regression with t-distributed noise. We find that P(y, zit) oc P(tly, r(z»P(y)P(z). (4) We use Gibbs sampling to sample from P(y, zit) by alternately sampling from P(zly, t) and P(ylz, t). Intuitively were are alternating the "fitting" of the curve (or 496 P. W. Goldberg, C. K. 1. Williams and C. M Bishop y-process) with "fitting" the noise level (z-process) . These two steps are discussed in turn . • Sampling from P(ylt, z) For y we have that P(ylt, z) ex P(tly, r(z»P(y) (5) where n 1 ( (ti Yi)2 ) P(tly, r(z» = TI (21l'Ti)l/2 exp 2Ti . (6) Equation (6) can also be written as P(tly,r(z» '" N(t,KN) ' Thus P(ylt,z) is a multivariate Gaussian with mean (Kyl + Ki/)-l K;/t and covariance matrix (Kyl + KN1)-1 which can be sampled by standard methods . • Sampling from P(zlt,y) For fixed y and t we obtain P(zly, t) ex P(tly, z)P(z). (7) The form of equation 6 means that it is not easy to sample z as a vector. Instead we can sample its components separately, which is a standard Gibbs sampling algorithm. Let Zi denote the ith component of z and let Z-i denote the remaining components. Then

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Gaussian Process Regression Networks

We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the nonparametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive dis...

متن کامل

Regression with Input-Dependent Noise: A Bayesian Treatment

In most treatments of the regression problem it is assumed that the distribution of target data can be described by a deterministic function of the inputs, together with additive Gaussian noise having constant variance. The use of maximum likelihood to train such models then corresponds to the minimization of a sum-of-squares error function. In many applications a more realistic model would all...

متن کامل

Gaussian Process Regression with Heteroscedastic or Non-Gaussian Residuals

Abstract Gaussian Process (GP) regression models typically assume that residuals are Gaussian and have the same variance for all observations. However, applications with input-dependent noise (heteroscedastic residuals) frequently arise in practice, as do applications in which the residuals do not have a Gaussian distribution. In this paper, we propose a GP Regression model with a latent variab...

متن کامل

Speech Enhancement Using Gaussian Mixture Models, Explicit Bayesian Estimation and Wiener Filtering

Gaussian Mixture Models (GMMs) of power spectral densities of speech and noise are used with explicit Bayesian estimations in Wiener filtering of noisy speech. No assumption is made on the nature or stationarity of the noise. No voice activity detection (VAD) or any other means is employed to estimate the input SNR. The GMM mean vectors are used to form sets of over-determined system of equatio...

متن کامل

An Adaptive Hierarchical Method Based on Wavelet and Adaptive Filtering for MRI Denoising

MRI is one of the most powerful techniques to study the internal structure of the body. MRI image quality is affected by various noises. Noises in MRI are usually thermal and mainly due to the motion of charged particles in the coil. Noise in MRI images also cause a limitation in the study of visual images as well as computer analysis of the images. In this paper, first, it is proved that proba...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997